[Dev] Feature: linear cross entropy fusion#2256
Conversation
Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
init fused linear cross-entropy interface
* add forward-mainloop and bwd_partial_dlogits kernel Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * skip TestFusedLinearCrossEntropyOnGptModel for single GPU Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * added unit-test for linear_cross_entropy on dp Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
* added unit-test for TP Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * add sequence-parallel and its unit-test Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
* 1. fix weight is None issue 2. API compatible fix * 1. fix weight is None issue 2. API compatible fix * fix fused linear-ce fusion loss issue * fix typo in fused_linear_ce triton * 1. fix weight is None issue 2. API compatible fix * fix fused linear-ce fusion loss issue * add sequence_parallel option on compute_language_model_loss_without_logits * Linear cross-entropy fusion is not used by default.
Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
* Remove redundant logits calculations in gpt_model * Merge the linear-cross-entropy-fusion flag and the cross-entropy-fusion flag
Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
* rename compute_output_layer_and_language_model_loss * remove used option fused_linear_cross_entropy in transformer_config
|
For convergency test, please refer to: #2206 (comment) |
|
Linking: #2206 |
| logits, _ = self.output_layer( | ||
| hidden_states, weight=output_weight, runtime_gather_output=runtime_gather_output | ||
| ) | ||
| if has_config_logger_enabled(self.config) or labels is None: |
There was a problem hiding this comment.
Could you elaborate on the meaning of the clause has_config_logger_enabled(self.config)?
There was a problem hiding this comment.
As shown in this line of the function, enabling logging will cause the logits to be printed to disk. This is why this flag is used here to determine whether logits need to be generated.
There was a problem hiding this comment.
This needs to be handled carefully in case somewhere needs logits while it's not computed.
There was a problem hiding this comment.
The logic here feels a bit messy. Can we clean it up to make it more straightforward? For example, when enabling logits logging, we could assert that this optimization is not allowed to be turned on.
There was a problem hiding this comment.
The logic here is that logits are calculated when we need them under certain circumstances.
This is orthogonal to enabling optimization.
* fix platform fail in test env * fix import error in no CUDA & CUTE test env * Revert "fix import error in no CUDA & CUTE test env" This reverts commit 0b8010b. * safe_imports check skip blackwell * try clean up * reduce fused_linear_cross_entopy UT problem size for OOM issue * skip UT when device arch not 10 * fix mamba logits compute order * 1. Add Copyright for init.py 2. Allow files under Blackwell to bypass import checks.
|
/ok to test fb2ee78 |
* [DEV] pull main Nov 25 (NVIDIA#2395) Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Youngeun <kyeg9404@gmail.com> Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: ykarnati <ykarnati@nvidia.com> Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com> Signed-off-by: GitHub Actions <github-actions[bot]@users.noreply.github.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by: Xiaowei Ren <xren@nvidia.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Yashaswi Karnati <144376261+yashaswikarnati@users.noreply.github.com> Co-authored-by: Jared Casper <155158+jaredcasper@users.noreply.github.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: Jianbin Chang <shjwudp@gmail.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <34819528+tdene@users.noreply.github.com> Co-authored-by: Siddharth Singh <136645615+sidsingh-nvidia@users.noreply.github.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: Lawrence McAfee <85179052+lmcafee-nvidia@users.noreply.github.com> Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Lawrence McAfee <lmcafee@nvidia.com> Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com> Co-authored-by: Deepak Narayanan <2724038+deepakn94@users.noreply.github.com> Co-authored-by: helen ngo <helenn@nvidia.com> Co-authored-by: GitHub Actions <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Robert Kirby <rkirby@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <tene@nvidia.com> Co-authored-by: yeyu-nvidia <yeyu@nvidia.com> Co-authored-by: Abhinav Khattar <akhattar@nvidia.com> Co-authored-by: Roger Waleffe <rwaleffe@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Tong Liu <liutongt1998@gmail.com> Co-authored-by: Zhongbo Zhu <42691305+zhongbozhu@users.noreply.github.com> Co-authored-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: root <root@pool0-01101.cm.cluster> Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Kan Zhu <kanz@nvidia.com> Co-authored-by: Robert Kirby <rkirby@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com> Co-authored-by: Jon Barker <19699370+jon-barker@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Tong Liu <tongliu@nvidia.com> * adding action for checking whether PR author is nvidia employee or not for selecting ephemeral ci hosts (NVIDIA#2402) Signed-off-by: oliver könig <okoenig@nvidia.com> * fix: exit failure when PR author is external contributor removed (NVIDIA#2410) * fix: adding k8s taints for ephermeral jobs (NVIDIA#2420) * ci: Enable functional tests (NVIDIA#2419) Signed-off-by: oliver könig <okoenig@nvidia.com> * Reapply "build: Upgrade deps (NVIDIA#2289)" (NVIDIA#2408) Signed-off-by: oliver könig <okoenig@nvidia.com> * fix: use a script to do node tainting in the cicd workflow (NVIDIA#2421) * Revert "[DEV] pull main Nov 25 (NVIDIA#2395)" This reverts commit 56682f8. Signed-off-by: oliver könig <okoenig@nvidia.com> * [Dev] Support packed seq in MTP (NVIDIA#2043) Signed-off-by: Li Tao <lit@nvidia.com> Signed-off-by: lit <lit@nvidia.com> * Fix runaway Etpt in straggler detector by resetting FLOPs accumulator (NVIDIA#2128) Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> Co-authored-by: Li Ruixiao <cgruixiao@outlook.com> * [Dev] feat(MoE): Refactor cuda_graph_scope - part2 (NVIDIA#2353) Signed-off-by: Robin Zhang <robinz@nvidia.com> * [dev] DeepSeek V3.2 support (NVIDIA#2154) Signed-off-by: kunlunl <kunlunl@nvidia.com> * Revert "[Dev] feat(MoE): Refactor cuda_graph_scope - part2 (NVIDIA#2353)" This reverts commit 92c8482. * Add logs for missing CUDA and Cute. * autoformat --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> Signed-off-by: Youngeun <kyeg9404@gmail.com> Signed-off-by: Maanu Grover <maanug@nvidia.com> Signed-off-by: ykarnati <ykarnati@nvidia.com> Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com> Signed-off-by: GitHub Actions <github-actions[bot]@users.noreply.github.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by: Xiaowei Ren <xren@nvidia.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Li Tao <lit@nvidia.com> Signed-off-by: lit <lit@nvidia.com> Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> Signed-off-by: Robin Zhang <robinz@nvidia.com> Signed-off-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Deyu Fu <Deyu.Foo@gmail.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Yashaswi Karnati <144376261+yashaswikarnati@users.noreply.github.com> Co-authored-by: Jared Casper <155158+jaredcasper@users.noreply.github.com> Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Ananth Subramaniam <ansubramania@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <34819528+tdene@users.noreply.github.com> Co-authored-by: Siddharth Singh <136645615+sidsingh-nvidia@users.noreply.github.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by: Lawrence McAfee <85179052+lmcafee-nvidia@users.noreply.github.com> Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com> Co-authored-by: Lawrence McAfee <lmcafee@nvidia.com> Co-authored-by: AJ Schmidt <ajschmidt8@users.noreply.github.com> Co-authored-by: Deepak Narayanan <2724038+deepakn94@users.noreply.github.com> Co-authored-by: helen ngo <helenn@nvidia.com> Co-authored-by: GitHub Actions <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Robert Kirby <rkirby@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <tene@nvidia.com> Co-authored-by: yeyu-nvidia <yeyu@nvidia.com> Co-authored-by: Abhinav Khattar <akhattar@nvidia.com> Co-authored-by: Roger Waleffe <rwaleffe@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Tong Liu <liutongt1998@gmail.com> Co-authored-by: Zhongbo Zhu <42691305+zhongbozhu@users.noreply.github.com> Co-authored-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Co-authored-by: Zijie Yan <zijiey@nvidia.com> Co-authored-by: root <root@pool0-01101.cm.cluster> Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Kan Zhu <kanz@nvidia.com> Co-authored-by: Robert Kirby <rkirby@cw-dfw-cs-001-vscode-01.cm.cluster> Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com> Co-authored-by: Jon Barker <19699370+jon-barker@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Co-authored-by: Tong Liu <tongliu@nvidia.com> Co-authored-by: Michael Wojcikiewicz <mwojcikiewic@nvidia.com> Co-authored-by: Li Tao <lit@nvidia.com> Co-authored-by: Santosh Bhavani <santosh.bhavani@live.com> Co-authored-by: Li Ruixiao <cgruixiao@outlook.com> Co-authored-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>
|
/ok to test 4dc347c |
|
/ok to test ae4d83a |
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |

What does this PR do ?
This PR introduces an implementation that fuses Linear Layer of lm_head and Cross-Entropy, in order to avoid materializing the intermediate logits tensor, helping reducing memory footprint.
PR to the main branch: #2206
Details about this feature
Training LLM typically involves a two-stage pipeline at the output layer: hidden states are projected into vocabulary
logitsvia a linear transformation (lm_head Layer), followed by Cross-Entropy loss computation against target tokens. While conceptually simple, such workflow incurs substantial overhead. The intermediate logits tensor, with dimension proportional tobatch size, sequence length, and vocabulary size, must be fully materialized in GPU memory, even though only one target token per position is ultimately used. This leads to significant memory footprint and bandwidth consumption, limiting scalability and slowing training throughput. The following code snippet might better illustrate that workflow:On top of the
local logittensor, other techniques might need some other intermediate buffers for collecting full information across all GPUs. For example, the following snippet is a TP compatible layer, comprised of torch native ops:By fusing
LinearandCross-Entropyinto one single operation, this PR could help avoid materializing the intermediatelogittensor.which could help reduce
2bsvmemory footprints AT LEAST.logittensor, whose shape is[batch, seqlen, vocabsize]grad of logittensor, whose shape is also[batch, seqlen, vocabsize]Functionalities
BF16 or FP16format, and will conduct accumulation and other logics inFP32format, avoiding precision problem.Data-Parallel,Tensor-Parallel along vocabsize, andSequence-Parallel along seqlen.tp_group is Noneit works in DP modetp_group is not None, and sequence_parallel is False, it works in TP modetp_group is not None, and sequence_parallel is True, it works in SP modeignore_idexaswhat native torch cross-entropy does.reductionmethod aswhat native torch cross-entropy does.Performance and Storage
In DP mode, this PR could lead to perf boost and storage reduction in the following config:

You may try the following steps to reproduce it:
# start a Megatron image on GB200 $ pip install nvidia-cutlass-dsl==4.2.1 $ pip install PyGithub $ pytest -s -v tests/unit_tests/fusions/test_fused_linear_cross_entropy.py $ torchrun --nproc_per_node=4 --nnodes=1 -m pytest -s -v tests/unit_tests/fusions/test_fused_linear_cross_entropy.pyContribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.